近年来,随着新颖的策略和应用,神经网络一直在迅速扩展。然而,尽管不可避免地会针对关键应用程序来解决这些挑战,例如神经网络技术诸如神经网络技术中仍未解决诸如神经网络技术的挑战。已经尝试通过用符号表示来表示和嵌入域知识来克服神经网络计算中的挑战。因此,出现了神经符号学习(Nesyl)概念,其中结合了符号表示的各个方面,并将常识带入神经网络(Nesyl)。在可解释性,推理和解释性至关重要的领域中,例如视频和图像字幕,提问和推理,健康信息学和基因组学,Nesyl表现出了有希望的结果。这篇综述介绍了一项有关最先进的Nesyl方法的全面调查,其原理,机器和深度学习算法的进步,诸如Opthalmology之类的应用以及最重要的是该新兴领域的未来观点。
translated by 谷歌翻译
在低光环境中捕获的图像经常遭受复杂的降级。简单地调整光不可避免地导致隐藏噪声和颜色失真的突发。从退化投入寻求满足的照明,清洁和现实主义的结果​​,这篇论文提出了一种灵感来自分界和规则原则的新颖框架,大大减轻了退化纠缠。假设图像可以被分解成纹理(具有可能的噪声)和颜色分量,可以具体地执行噪声去除和颜色校正以及光调节。为此目的,我们建议将来自RGB空间的图像转换为亮度色度。可调节的噪声抑制网络设计用于消除亮度亮度的噪声,其具有估计的照明图以指示噪声升高水平。增强型亮度进一步用于色度映射器的指导,以产生现实颜色。进行了广泛的实验,揭示了我们设计的有效性,并在几个基准数据集上展示了定量和定性的最先进的替代方案的优势。我们的代码在HTTPS://github.com/mingcv/bread下公开提供。
translated by 谷歌翻译
Transformer-based models have been widely demonstrated to be successful in computer vision tasks by modelling long-range dependencies and capturing global representations. However, they are often dominated by features of large patterns leading to the loss of local details (e.g., boundaries and small objects), which are critical in medical image segmentation. To alleviate this problem, we propose a Dual-Aggregation Transformer Network called DuAT, which is characterized by two innovative designs, namely, the Global-to-Local Spatial Aggregation (GLSA) and Selective Boundary Aggregation (SBA) modules. The GLSA has the ability to aggregate and represent both global and local spatial features, which are beneficial for locating large and small objects, respectively. The SBA module is used to aggregate the boundary characteristic from low-level features and semantic information from high-level features for better preserving boundary details and locating the re-calibration objects. Extensive experiments in six benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods in the segmentation of skin lesion images, and polyps in colonoscopy images. In addition, our approach is more robust than existing methods in various challenging situations such as small object segmentation and ambiguous object boundaries.
translated by 谷歌翻译
In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.
translated by 谷歌翻译
With the advanced request to employ a team of robots to perform a task collaboratively, the research community has become increasingly interested in collaborative simultaneous localization and mapping. Unfortunately, existing datasets are limited in the scale and variation of the collaborative trajectories, even though generalization between inter-trajectories among different agents is crucial to the overall viability of collaborative tasks. To help align the research community's contributions with realistic multiagent ordinated SLAM problems, we propose S3E, a large-scale multimodal dataset captured by a fleet of unmanned ground vehicles along four designed collaborative trajectory paradigms. S3E consists of 7 outdoor and 5 indoor sequences that each exceed 200 seconds, consisting of well temporal synchronized and spatial calibrated high-frequency IMU, high-quality stereo camera, and 360 degree LiDAR data. Crucially, our effort exceeds previous attempts regarding dataset size, scene variability, and complexity. It has 4x as much average recording time as the pioneering EuRoC dataset. We also provide careful dataset analysis as well as baselines for collaborative SLAM and single counterparts. Data and more up-to-date details are found at https://github.com/PengYu-Team/S3E.
translated by 谷歌翻译
多模式变压器的最新努力通过合并视觉和文本信息改善了视觉上丰富的文档理解(VRDU)任务。但是,现有的方法主要集中于诸如单词和文档图像贴片之类的细粒元素,这使得他们很难从粗粒元素中学习,包括短语和显着视觉区域(如突出的图像区域)等自然词汇单元。在本文中,我们对包含高密度信息和一致语义的粗粒元素更为重要,这对于文档理解很有价值。首先,提出了文档图来模拟多层次多模式元素之间的复杂关系,其中通过基于群集的方法检测到显着的视觉区域。然后,提出了一种称为mmlayout的多模式变压器,以将粗粒的信息纳入基于图形的现有预训练的细颗粒的多峰变压器中。在mmlayout中,粗粒信息是从细粒度聚集的,然后在进一步处理后,将其融合到细粒度中以进行最终预测。此外,引入常识增强以利用天然词汇单元的语义信息。关于四个任务的实验结果,包括信息提取和文档问答,表明我们的方法可以根据细粒元素改善多模式变压器的性能,并使用更少的参数实现更好的性能。定性分析表明,我们的方法可以在粗粒元素中捕获一致的语义。
translated by 谷歌翻译
物流运营商最近提出了一项技术,可以帮助降低城市货运分销中的交通拥堵和运营成本,最近提出了移动包裹储物柜(MPLS)。鉴于他们能够在整个部署领域搬迁,因此他们具有提高客户可访问性和便利性的潜力。在这项研究中,我们制定了移动包裹储物柜问题(MPLP),这是位置路由问题(LRP)的特殊情况,该案例确定了整天MPL的最佳中途停留位置以及计划相应的交付路线。开发了基于混合Q学习网络的方法(HQM),以解决所得大问题实例的计算复杂性,同时逃脱了本地Optima。此外,HQM与全球和局部搜索机制集成在一起,以解决经典强化学习(RL)方法所面临的探索和剥削困境。我们检查了HQM在不同问题大小(最多200个节点)下的性能,并根据遗传算法(GA)进行了基准测试。我们的结果表明,HQM获得的平均奖励比GA高1.96倍,这表明HQM具有更好的优化能力。最后,我们确定有助于车队规模要求,旅行距离和服务延迟的关键因素。我们的发现概述了MPL的效率主要取决于时间窗口的长度和MPL中断的部署。
translated by 谷歌翻译
大型视觉基础模型在自然图像上的视觉任务上取得了重大进展,在这种情况下,视觉变压器是其良好可扩展性和表示能力的主要选择。但是,在现有模型仍处于小规模的情况下,遥感社区(RS)社区中大型模型的利用仍然不足,从而限制了性能。在本文中,我们使用约1亿个参数求助于普通视觉变压器,并首次尝试提出针对RS任务定制的大型视觉模型,并探索如此大型模型的性能。具体而言,要处理RS图像中各种取向的较大图像大小和对象,我们提出了一个新的旋转型尺寸的窗户注意力,以替代变形金刚中的原始关注,这可以大大降低计算成本和内存足迹,同时学习更好的对象通过从生成的不同窗口中提取丰富上下文来表示。关于检测任务的实验证明了我们模型的优越性,超过了所有最新模型,在DOTA-V1.0数据集上实现了81.16 \%地图。与现有的高级方法相比,我们在下游分类和细分任务上的模型结果也证明了竞争性能。进一步的实验显示了我们模型对计算复杂性和几乎没有学习的优势。代码和模型将在https://github.com/vitae-transformer/remote-sensing-rvsa上发布
translated by 谷歌翻译